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Abstract 

The GMRES method is parallelized, and com- 
bined with local preconditioning to construct an 
implicit parallel solver to obtain steady-state solu- 
tions for the Navier-Stokes equations of fluid flow 
on distributed-memory machines. The new im- 
plicit parallel solver is designed to preserve the 
convergence rate of the equivalent ‘serial’ solver. 
A static domain-decomposition is used to parti- 
tion the computational domain amongst the avail- 
able processing nodes of the parallel machine. The 
SPMD (Single-Program Multiple-Data) program- 
ming model is combined with message- passing tools 
to develop the parallel code on a 32-node Intel Hy- 
percube and a 512-node Intel Delta machine. The 
implicit parallel solver is validated for internal and 
external flow problems, and is found to compare 
identically with flow solutions obtained on a Cray 
Y-MP/8. The computational speed on 32 process- 
ing nodes of the i860 machines is comparable to the 
speed on a single processor of the Cray Y-MP. A 
peak computational speed of 2300 MFlops/sec has 
been achieved on 512 nodes of the Intel Delta ma- 
chine, for a problem size of 1024K equations (256K 
grid points). 

Introduction 

Parallel machines based on distributed-memory 
architectures are providing viable alternatives to ex- 
pensive vector supercomputers for performing nu- 
merical integrations of large-scale Computational 
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Fluid Dynamics (CFD) problems. As CFD codes 
are combined with other codes to perform multidis- 
ciplinary work, severe pressures will be exerted on 
modern supercomputers — both in terms of mem- 
ory requirements and CPU time. 

The computational power of multiprocessor 
machines with a large number of processors can be 
harnessed to obtain solutions for the non-linear par- 
tial differential equations of fluid flow. CFD can 
choose between the wide variety of machine archi- 
tectures currently available — shared-memory ma- 
chines (e.g. Cray Y-MP), distributed-memory ma- 
chines (e.g. Intel Hypercube) or machines with hy- 
bird architectures. This paper concentrates on using 
a distributed memory, Multiple Instruction Multiple 
Data (MIMD), message-passing architecture to ob- 
tain solutions to the Navier-Stokes equations of fluid 
flow. 

The Navier-Stokes equations are discretized in 
space using an upwind, finite-volume, flux-split for- 
mulation. The discretized equations are linearized 
with an Euler-implicit linearization, and integrated 
in time to obtain a steady-state solution. The Euler- 
implicit linearization produces a system of simul- 
taneous linear equations characterized by a large, 
sparse, non-symmetric coefficient matrix. The work 
described in this paper concentrates on investigating 
parallel implementations of Krylov-subspace meth- 
ods, particularly GMRES 1 , for solving this linear 
system of equations at each time-step of the time- 
integration. 

In earlier work on parallel Krylov solvers, Saad 
& Schultz 2 combined GMRES with ILU precondi- 
tionings on shared-memory machines and loosely- 
coupled linear or mesh-connected arrays. O’Leary 3 
implemented the block Conjugate Gradient algo- 
rithm on a coarse-grained parallel machine. An- 
derson & Saad 4 examined the standard ILU(O) and 
polynomial preconditioners for shared-memory ma- 
chines. Their work concludes that ILU(O) may 
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outperform polynomial preconditioning if the num- 
ber of processors is small. Radicati & Robert 5 
used overlapped domain decomposition to imple- 
ment ILU preconditioning on a shared-memory mul- 
tiprocessor. Their numerical experiments showed 
that a local ILU factor on overlapping blocks is a 
good preconditioning strategy. Baxter et. al 6 ex- 
amined the performance of Krylov solvers with ILU 
preconditioning on a shared-memory machine and 
on Hypercube architectures. Application problems 
from reservoir engineering and mathematics were 
studied in their work. 

In more recent work on machines with large 
number of processors, Berryman et. al 7 have mea- 
sured the performance of key interactive kernels 
(sparse triangular solves, inner products etc.) of 
preconditioned Krylov solvers on the CM- 2 ma- 
chine. The ideas of multi-level domain decompo- 
sition for preconditioned Krylov solvers have been 
presented by Gropp and Keyes 8 . Shadid and 
Tuminaro 9 have implemented GMRES and other 
Krylov solvers on a 1024- node ncube/2 Hypercube. 
Their work has examined the efficiency of various lo- 
cal and global preconditioners, for several standard, 
model problems. Desturler 10 has proposed a mod- 
ified, computationally cheaper version of GMRES 
for medium to fine grained parallelism on MIMD 
machines. 

In summary, several authors have explored im- 
plementations of Krylov solvers on shared-memory 
and distributed-memory architectures. Most of the 
works have examined ‘model’ problems arising from 
elliptic and hyperbolic PDEs. The ILU factoriza- 
tion seems to be a popular preconditioner, as sev- 
eral efforts have been made to study its performance 
in the parallel environment. However, a comprehen- 
sive study of the performance of preconditioners and 
Krylov solvers for practical CFD problems appears 
to be lacking in the literature. 

This paper investigates the viability of using 
the preconditioned GMRES algorithm in a domain- 
decomposition framework to develop an implicit 
solver for solving CFD problems on distributed- 
memory machines. A new local preconditioner, 
which is much cheaper than the popular ILU pre- 
conditioner, has been developed to support efficient 
implementation of the parallel GMRES solver. The 
new preconditioner is based on symmetric Gauss- 
Seidel sweeps across each domain, and shows excel- 
lent scalability over a large range of processors. The 
implicit parallel solver is validated on a 32-node In- 
tel Hypercube and a 512-node Intel Delta machine. 

This introduction section is followed by a sec- 


tion describing some of the theory of parallel ma- 
chines (MIMD architectures in particular), CFD 
and Krylov solvers. Detailed results of computa- 
tions on an Intel Hypercube and Intel Delta ma- 
chines will then be presented. The paper will con- 
clude with some remarks about the present results 
and some ideas for future work in the area of im- 
plicit parallel solvers for CFD codes. 

Presentation of Theory 

Distributed-Memory Systems 

Multiple Instruction Multiple Data (MIMD) 
or distributed-memory machines are characterized 
by a grouping of processors which are capable of 
functioning independently as computational nodes. 
Each processor has individual memory, computa- 
tional units and communication units. Information 
is exchanged amongst processors by sending pack- 
ets of information or ‘messages’ from one proces- 
sor to another. Each processor has its own clock, 
and there is is no ‘global’ clock. The processors 
can be ‘synchronized’ by performing a ‘global’ com- 
munication. The connectivity between processors 
defines the topology of the machine and determines 
the speed at which messages are passed from one 
processor to another. 

The processors in a Hypercube architecture are 
interconnected with a cube-type connectivity ; each 
processing node lies on the vertex of an order-n 
cube. The 32-node Intel Hypercube machine at 
NASA Lewis Research Center is organized as a 
cube of order 5. In contrast, the processors for the 
512-node Intel Delta machine at CalTech are ar- 
ranged in a mesh-type connectivity (16*32 mesh). 
Both machines are based on the Intel i860 micro- 
processor, with 16 MBytes of memory per node. 
The i860 is a 40 MHz RISC microprocessor with a 
peak theoretical rating of 32 MIPS (integer perfor- 
mance) and 60 Mflops (64- bit floating-point perfor- 
mance). The communication networks for the Intel 
machines are characterized by relatively low com- 
munication bandwidths and high communication la- 
tencies. This implies that a few long messages are 
preferable to numerous short messages. Further de- 
tails of the Hypercube architecture may be found in 
reference 11. 

Navier-Stokes Equations 

The governing equations of compressible fluid 
flow in 2-D are the Navier-Stokes equations written 
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as 


^ + ^ + ^ = + (I) 

di dx dy dx dy 

where Q is the vector of independent conserved vari- 
ables, F and G are inviscid flux vectors and F v and 
G v are viscous flux vectors. The governing equa- 
tions are solved computationally in their integral, 
conservation law form in generalized coordinates, 
using a cell-centered finite volume formulation. The 
inviscid fluxes are up winded using Van Leer’s 12 flux- 
splitting scheme. The viscous fluxes are evaluated 
with second-order accurate central-differences. Ad- 
ditional details of the ‘serial’ code may be found in 
reference 13. 

The generalized-coordinate form of equation 1 
can be written in compact form as 


ldQ 
J dt 


= -R 


( 2 ) 


where J is the jacobian of the transformation from 
cartesian to generalized coordinates. R is called the 
residual vector, and equals to zero for a steady-state 
solution. The accuracy of the computed solution 
is directly affected by the accuracy of the residual 
vector computation. A nine-point stencil is used 
for second-order accurate calculations of the resid- 
ual vector. 

The Euler-implicit time-linearization of equa- 
tion 2 results in 


A Q n 
J At 


—R n+l 


( 3 ) 


where A Q n is the incremental change in the cell- 
centered values of the vector Q between the n + 1 th 
time level and the known n th time level, i.e. A Q n = 
<? n+1 — Q n . R n+l is linearized in time about the 
n th time level which results in 


< 4 > 

where is a block-diagonal matrix and is 
a large, sparse, block, banded non-symmetric ma- 
trix. In this paper, is evaluated with a five-point 
stencil (as compared to a nine-point stencil com- 
putationa for the R). This is done to reduce the 
computational and storage costs associated with a 
nine-point stencil evaluation of at the expense of 
increased time-steps required to reach a converged 
steady-state solution. 

Equation 4 can be rewritten in matrix-vector 
form as 


[V n ]{A Q n } = -{IT} (5) 


Equation 5 represents the system of simultaneous 
linear equations which has to be solved for A Q n at 
each time-step of the time-integration. For small 
(of order 100), well- conditioned coefficient matri- 
ces, the system may be solved exactly by inverting 
the matrix [ V n ] at each time-step. However, for 
large and/or poorly-conditioned matrices (found in 
practical CFD applications), an iterative solution 
of equation 5 becomes more viable. The precon- 
ditioned GMRES method, investigated in reference 
13, is parallelized to solve equation 5. Some issues 
related to the development of the parallel solver are 
now discussed. 


Parallel Domain- Decomposition 

The discretized Navier-Stokes equations can be 
solved in a parallel framework by decomposing the 
original, large uniprocessor domain of grid points 
into a number of smaller domains which are then 
distributed to the available processors (one domain 
per processor). In a distributed-memory system, 
each processor can only access information from its 
own local-memory. Thus, information may have to 
be exchanged across the domain interfaces (or pro- 
cessor boundaries) in order to preserve the charac- 
teristics of the original problem. 

Recall, that the residual vector computation 
uses a nine-point stencil. Thus, the flux-evaluation 
for cell-faces which he on (and adjacent to) domain 
boundaries will require information from (a maxi- 
mum of) two adjacent cells which reside in a neigh- 
boring processor. This information exchange is fa- 
cilitated by creating two layers of ‘ghost’ cells at 
each domain boundary. At each time-step, data 
from the neighboring domains is ‘communicated’ to 
these ‘ghost’ cells before the flux-evaluation rou- 
tines are invoked. This communication is done 
prior to the application of the explicit boundary 
conditions. The flux-balance evaluated by this ap- 
proach has been validated to be identical to the 
flux-balance computed for the original uniprocessor 
domain. Note that each domain must contain at 
least three cell-faces (in each coordinate direction) 
for this approach to work successfully. 

The implementation of boundary conditions at 
physical boundaries may also require communica- 
tion amongst processors. Airfoil calculations on C 
and O-type meshes require communication between 
non-neighboring processors in order to effect C and 
O-type periodicity. This is achieved by communi- 
cation amongst domains which he along the wake- 
cut line of the particular C or O-type grid. Note, 
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that the boundary condition routines are invoked 
only for those processors which contain actual phys- 
ical boundaries corresponding to the inflow, outflow, 
bottom and top planes of the original uniprocessor 
domain. This creates an imbalance in the work- 
load across the processors, since the ‘interior’ pro- 
cessors do not perform boundary condition calcula- 
tions. This imbalance is not significant since <1% 
of the total computer time is required for the bound- 
ary condition computations. 

The in viscid and viscous flux vectors, and the 
respective flux-jacobian matricesare first calculated. 
The individal flux-jacbian matrices are then as- 
sembled into the implicit, left-hand-side coefficient 
matrix, for each domain. The coefficient matrix 
is assembled from linear combinations of the flux- 
jacobian matrices. Each domain assembles its own 
individual matrix, and no extra communication is 
required for this computational step. A five-point 
stencil is used for the implicit operator, providing 
a sparse, banded, coefficient matrix with five block- 
diagonals. 

The Parallel GMRES solver 

The original, large, system of linear equations 
corresponding to the uniprocessor domain is thus 
transformed to a series of smaller linear systems, 
with one linear system for each processor. A pre- 
conditioned GMRES solver is used to iteratively 
solve each linear system. The original GMRES 
method is designed to iteratively solve linear sys- 
tems with non-symmetric coefficient matrices. In 
order to solve a linear system (say Ax = 6), the 
method seeks an approximate solution x* of the 
form X* = Xo -h 2jt, where xq is some arbitrary ini- 
tial guess to the exact solution x. The vector z * lies 
in the Krylov subspace, K(A, ro,k), defined by the 
matrix A , the unit-vector wi — tq/P (ro = b — Ax 0 , 
p = ||ro||) and the size of the Krylov subspace k . 

The GMRES solver will converge the fastest 
when each successive iterate x* minimizes the resid- 
ual norm, ||r*|| over the subspace K(A, r 0 , k). One 
major practical difficulty with GMRES is that when 
k increases, both storage and operation cost increase 
as 0(&) and 0(k 2 ), respectively. If the available 
storage is limited, the method may be restarted af- 
ter k sub-iterations, with x* replacing xo in step 1. 
The restart version is often used in practical prob- 
lems and is referred to as GMRES(k). 

The complete GMRES algorithm can be writ- 
ten as follows: 

1. For any starting vector xq, form r 0 = b — 


Ax o ; 0 = ||r 0 ||2 ; = r o /0 

2. Perform k steps of Arnoldi’s method 14 with wi 
as the initial vector to form k vectors which are 
successively orthogonal to Wi 

3. Find the vector (say Zf.) which minimizes ||rfc|| 
in the subspace defined by the vectors in step 2. 
Compute the solution x* = xo + z* 

The GMRES algorithm involves three basic lin- 
ear algebra operations — inner-products of vectors 
(steps 1 &: 2), saxpy operations (steps 1 & 3) and 
matrix- vector products (steps 1 & 2). Evaluation 
of the inner-products requires inter-processor com- 
munications since the local inner-products have to 
be accumulated across all the processors. This can 
be done by passing 2log2N messages across the N 
processors. The saxpy operation can be performed 
independently by each processor, since only local 
data needs to be manipulated. 

Each matrix-vector operation requires commu- 
nication of the ‘boundary’ elements of the particular 
multiplying vector to neighboring processors. The 
components that correspond to the cells lying on 
the the four boundaries of each domain are commu- 
nicated to ‘ghost’ cells of the neighboring domains. 
This is critical in order to reproduce the uniproces- 
sor matrix-vector product, i.e. the product resulting 
from multiplying the original, single-domain coef- 
ficient matrix with a given vector. The multiple- 
processor matrix-vector product is required to be 
identical to the uniprocessor matrix- vector product, 
at each sub-iteration of the GMRES solver. This is 
to ensure that the parallel GMRES solver (without 
preconditioning) has the exact convergence rate as 
the serial GMRES solver. 

Preconditioning the Linear System 

The rate of convergence of any iterative al- 
gorithm depends on the condition number, *2 (A), 
of the iteration matrix A and the distribution of 
singular- values of A. If k 2 (A) is large and/or the 
spectrum of singular- values of A is wide and scat- 
tered, the matrix A is poorly conditioned, and the 
underlying algorithm may converge slowly. Precon- 
ditioners improve the conditioning of the iteration 
matrix, and usually have a first-order effect on im- 
proving the convergence rate and overall efficiency 
of solvers based on GMRES-like algorithms. For- 
mally, a preconditioning matrix M transforms the 
original system Ax = b into 

M~ l Ax = M~ 1 b & Ax = b (6) 

The operation of M~ l on any vector (say u — 
Ax) is equivalent to the solution of a linear system 
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Mu = u, with M as the coefficient matrix. Such 
linear systems have to be solved repeatedly for each 
sub-iteration of the preconditioned GMRES algo- 
rithm. Any matrix M which produces easy-to-solve 
linear systems of the type Mu — u (e.g. M = Diag- 
onal of A), is a potentially efficient preconditioner. 

The costs associated with preconditioning can 
be enumerated as (i) Computation of the precondi- 
tioning matrix M, (ii) Linear system solves associ- 
ated with M, and, (iii) Additional storage for the 
matrix M, which may be of the order of storage re- 
quirements for the matrix A. The selection of an 
‘efficient’ preconditioner is motivated by the mini- 
mization of the afore-mentioned costs. 

Several different preconditioners that can be 
chosen are diagonal ( M = major diagonal of A), 
block-diagonal, incomplete L-U factorization (ILU) 
and block-ILU 15 — in increasing order of the cost 
to calculate and store M . Iterative methods used 
in existing CFD codes can also be used as effective 
preconditioners. A variant of the iterative scheme 
of Yoon and Jameson 16 is used locally in each do- 
main as a parallel preconditioner. This precondi- 
tioner, referred to herein as the LUSGS precondi- 
tioner, was validated with the serial GMRES algo- 
rithm in reference 13, and found to be much more 
efficient (in terms of CPU time and storage) and ex- 
tremely competitive (in terms of convergence rate) 
with the currently popular preconditioners based on 
ILU factorizations of the matrix A. 

The LUSGS preconditioner is applied to solve 
linear systems like [Ilf] {it} = {tt} as follows : 

1. Approximately factor [ M ] « [L][D] _1 [Cf] where 
[D]=major block- Diagonal of [M], [L]=([D] — 
lower-triangular part of [Hf]), [Lf]=([Z>] — upper 
triangular part of [M]). 

2. Invert the block-diagonal matrix [D] 

3. Sweep forward (bottom-left to top-right) to 
solve [L]{u} = {u} for {it}. 

4. Sweep backward (top-right to bottom-left) to 
solve [U]{u} = [£>]{£} for {it}. 

The LUSGS preconditioner described above is 
used without modification in the parallel implemen- 
tation. Consequently, the sweeps in steps 3 & 4 
(which are now partial sweeps limited to the con- 
fines of the individual domains) will not produce 
the same solutions as the serial algorithm. How- 
ever, this is found to have only marginal effects 
on the convergence of the parallel, preconditioned 
GMRES solver. The preconditioner performs excel- 
lently for rectangular or square domains, and the 
performance deteriorates slightly for high aspect- 
ratio domains. The LUSGS preconditioned GMRES 


solver is found to maintain its convergence charac- 
teristics to within 5% of the serial solver (for upto 
512 processors). This demonstrates the excellent 
scalability of the new solver. Note, that if the num- 
ber of processors equals the number of computa- 
tional cells, the LUSGS preconditioner is equivalent 
to a fully-scalable block-diagonal preconditioner. 

I/O and Memory Considerations 

This paper uses the Single Program Multiple 
Data (SPMD) model of parallel programming to rim 
identical copies of the code on all processors. Each 
processor performs its input/output operations in- 
dependently of the other processors. The fastest 
way of performing I/O operations on both iP SC/8 60 
systems is to read/write from/to the Concurrent 
File System (CFS). Each processor can access data 
from the CFS at a peak rate of 1.5 Mbytes per sec- 
ond. 

Three data files are required by each processor 
— an input parameter file, a grid file and a restart 
file (if restarting). In the current implementation, 
all processors read from a commonly shared parame- 
ter file and grid file, both of which reside on the CFS. 
Each processor determines its position in the global 
domain, and correspondingly extracts the relevant 
information from the parameter file and grid file. 
Each processor is provided its own unique restart 
file (if restarting), which it reads directly from the 
CFS. 

On completion of the user-specified time-steps, 
each processor outputs a solution file directly to 
the CFS. This invokes multiple writes to the CFS 
and may cause delays due to contention for the 
I/O nodes as all processors try and write to the 
CFS at the same time. Since the global solution 
is distributed across the various processors, a post- 
processing program has been written to assemble 
the global solution from the various output files. 
The global solution file can also be used to gen- 
erate restart information for any number of proces- 
sors. The post-processor can be incorporated into 
the CFD code itself, but has been avoided in favor 
of the increased flexibility afforded by the current 
approach. 

It must be mentioned that if the I/O operations 
are done to/fxom the front-end (or ‘remote host’) 
system (a Sun Sparc 10 Workstation), the wall-clock 
time of each run increases considerably. In addition, 
if intermittent solution files have to be output to the 
CFS (e.g. for an unsteady calculation), the I/O time 
can tend to dominate the overall wall-clock time. 
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The memory requirements for the implicit code 
are estimated at 320 words per grid point per pro- 
cessor. This includes storage for 10 search directions 
of the GMRES solver. 16 MBytes of RAM is avail- 
able on each processor of the Intel Hypercube and 
Intel Delta machines. In practice, a maximum of 
« 3000 grid points (corresponding to ^ 10 MBytes 
of RAM) could be assigned to each processor, when 
using 64-bit floating-point arithmetic. The remain- 
ing memory is assigned for data, performance mon- 
itoring tools, system software and communications 
software. Hence, the maximum problem size is re- 
stricted by the total available memory on the par- 
allel machine. 

Test Results and Discussion 

A parallel, preconditioned GMRES solver has 
been implemented for implicit solutions of the two- 
dimensional, upwind, finite- volume, Navier- Stokes 
equations. The global uniprocessor domain repre- 
senting the computational grid is partitioned among 
the processors of a distributed-memory machine. 
Each processor runs identical copies of the same 
computational code on different sets of data. The 
processors communicate with each other at several 
times during each computational time-step in order 
to exchange information. 

The parallel code has been developed on an 
Intel Hypercube with 32 processors. All code- 
development, testing and debugging, and perfor- 
mance optimization has been done on the Hyper- 
cube. The parallel code has been validated against 
the original serial code (which is run on a single 
processor of the parallel machine). Results from 
the parallel residual vector computation and par- 
allel GMRES solver have been validated indepen- 
dently over different domain decompositions, and 
found to be identical to the serial code. This ensures 
complete scalability of the domain decomposition al- 
gorithm and the unpreconditioned GMRES solver. 
The two problems selected for validation were low- 
speed flow over a backward-facing step and subsonic 
flow over a NACA 1406 airfoil. Subsequently, the 
parallel code was ported to an Intel Delta machine 
with 512 processors. All performance results pre- 
sented here are based on data obtained from com- 
putations on the Intel Delta machine. 

Parallel Code Validation 

The problem of computing low-speed flow lam- 
inar over a back ward- facing step was the first test 
case to validate the parallel solver for internal flow 


conditions. This flow problem illustrates the phe- 
nomena of flow separation and recirculation in in- 
ternal flows. All flow variables are second-order ac- 
curate, fully-upwinded in the stream wise direction, 
and third-order accurate, upwind-biased in the nor- 
mal direction. The implicit (left-hand-side) opera- 
tor is discretized in a first-order accurate manner. 
Excellent comparisons with experimental data of 
Armaly et. al 17 have been obtained for this prob- 
lem with the serial code 13 . 

The parallel validation was performed on 16 
nodes of the Hypercube, with a 4 * 4 decomposition 
of the 61*51 global grid. This partitioning assigns 
16 * 14 grid points to nodes 0-7 and 16 * 13 points 
to nodes 8-15. The uniprocessor grid is shown in 
Figure 1. A freest ream Mach number of = 0.1 
was specified. Four different flow conditions with 
Reynolds number of Re oo = 100, 200, 300 and 389 
were computed. Adiabatic, no slip boundary condi- 
tions were used on the top and bottom walls form- 
ing the boundaries of the channel, and on the lower 
portion (which defines the step) of the inflow bound- 
ary. For fully developed subsonic flow at the outflow 
boundary, three variables (p, u and v) were extrapo- 
lated and the pressure was determined by fixing the 
stagnation enthalpy. The parabolic velocity profile 
at the inflow boundary was simulated by imposing 
a profile of Reimann invariants. 

backward-facing step 

61x51 GRID 



Figure 1. Computational Grid for Backward- 
Facing Step 

Figure 2 shows Mach number contours obtained 
for the Re = 389 case. The nature and size of 
the separation and recirculation behind the step 
closely matches the physical description of the flow 
as obtained in the experimental data. The ratio 
of reattachment distance (Ar) to step-height (S) 
was found to be identical to the uniprocessor cal- 
culations, for all the four Reynolds’ numbers. The 
calculated Xr/S ratios also compared very well (to 
within ±5% accuracy) with the experimental data. 

Low Reynolds number, laminar, subsonic flow 
over a NACA 1406 airfoil was the second valida- 
tion test case for the parallel solver, for external 
flow conditions. The flow conditions correspond 
to a freest ream Mach number of M = 0.6, an- 
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Figure 2. Mach Number Contours for Backward- 
Facing Step 

gle of attack, a = 1.0°, and Reynolds number, 
Re = 5.0 * 10 3 . The computational grid was a l C’ 
mesh of 257 * 65 points, and is shown in figure 3. 
The far-field boundary was placed five-chords from 
the airfoil and points are clustered near the airfoil 
to resolve viscous gradients. All flow variables are 
third-order accurate, upwind-biased in the stream- 
wise and normal directions. 


SUBSONIC AIRFOIL 
257x65 GRID 



Figure 3. Computational Grid for Subsonic Air- 
foil 


The parallel validation for this case was per- 
formed on 32 nodes of the Hypercube with an 
8*4 partitioning of the 257 * 65 grid to yield 
65 * 17 grid points per node. Figure 4 is a plot 
of the computed steady-state pressure coefficient, 
C p , on the surface of the airfoil. The computed 
lift, drag, and pitching moment coefficients ob- 
tained were Cl =0.18148, Cd=0. 041703, and Cm = - 
0.023718, respectively. These coefficients compared 
exactly with those computed with the serial version 
of the code. 

Convergence Rate Comparisons 

A comparison of convergence rates on the serial 
and parallel machines reveals the effectiveness and 
accuracy of the parallel implicit solver. A compari- 
son for the backward-facing step test case is shown 
in figure 5. A constant Courant number of 50 has 
been used. It is clear that the parallel precondi- 
tioned GMRES solver retains the convergence char- 


SUBSONIC AIRFOIL 



Figure 4. Chordwise Distribution of C p 


acteristics of the serial solver. The differences in the 
serial and parallel solvers arise because of the local 
(instead of global) implementation of the parallel 
LUSGS preconditioner. The LUSGS sweeps were 
restricted to the individual processor domains, and 
no additional message-passing was done to repro- 
duce the uniprocessor LUSGS sweeps of the global 
domain. This seemed to have a negligible impact on 
the convergence rate of the parallel solver. 


BACKWARD FACING STEP 



Figure 5. Convergence histories for Backward- 
Facing Step 

Figure 6 traces the convergence rates for the 
subsonic airfoil calculations, for 1500 time-steps. 
The parallel preconditioned GMRES solver was 
tested on 16(8*2), 32(8*4), 64(16*4), 128(16*8) and 
256(32*8) nodes. A constant Courant number of 10 
was used for the comparisons with the serial solver. 
It can be seen in figure 6 that the convergence rate of 
the parallel preconditioned GMRES solver decreases 
as the number of processors increases. This decrease 
implies that the number of time-steps required by 
the parallel solver to attain an eight-order reduction 
in the I 2 norm of the residual vector, will increase 
slightly (5-10%) as compared to the serial solver. 
The decrease in convergence rate is negligible up to 
a four-order residual reduction, which is usually suf- 
ficient for most engineering problems. Thus, it can 
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be claimed that an implicit parallel code (includ- 
ing the preconditioner, the GMRES solver and the 
residual-vector computation) has been designed to 
perform consistently over a large number of proces- 
sors in a distributed-memory environment. 


SUBSONIC AIRFOIL 



Figure 6. Convergence histories for Subsonic Air- 
foil 


Parallel Performance Results 

The parallel code was run on a single node 
of the Hypercube to determine the single-processor 
performance. The code was compiled with the max- 
imum available vectorization and optimization op- 
tions. The 61*51 grid from the backward-facing step 
calculation was used, as this was the largest number 
of points that could be accommodated on a single 
node (in accordance with memory requirements of 
the CFD code). The preconditiond GMRES solver 
was run for 100 time-steps, with 5 sub-iterations 
per time-step. The average single-processor CPU 
time for the Intel Hypercube was recorded as 340 
seconds. The total number of floating-point oper- 
ations were determined by invoking the hardware 
performance monitor (hpm) of the Cray Y-MP. The 
hpm indicated that the code performed 1300 Mflops, 
which translated to a 1300/340=3.8 Mflops/sec rat- 
ing for a single processor of the Intel Hypercube. 
The unpreconditioned GMRES solver performed at 
a higher rate of 5.9 Mflops/sec on a single node of 
the Intel Hypercube (1290 Mflops, 220 seconds, 100 
time-steps, 10 sub-iterations per time-step). 

The slower performance of the LUSGS precon- 
ditioned GMRES solver can be attributed in part 
to the inherent lack of vectorization of the LUSGS 
preconditioner. However, in practice, the faster vec- 
tor performance of the unpreconditioned GMRES 
solver was sufficiently compensated for by the much 
superior steady-state convergence rate of the pre- 
conditioned solver. The use of the LUSGS precon- 


ditioner considerably reduced the CPU time to at- 
tain a steady-state solution 13 . This suggests that 
the LUSGS preconditioner can be used profitably 
in a parallel environment, provided the convergence 
rate does not suffer as the number of processors is 
increased. 

Recall that the i860 processor is rated at 
60 Mflops/sec for 64-bit floating-point operations. 
Hence, when running at 5.9 Mflops/sec, only 10% 
of the peak performance is achieved (by the un- 
preconditioned GMRES solver) on a single node. 
These performance numbers seem to be very low, 
but they compare very favorably with other typi- 
cal CFD applications 18 on machines built around 
the i860 microprocessor. As a comparison, the 
unpreconditioned and LUSGS-preconditioned GM- 
RES solvers performed at rates of 120 Mflops/sec 
and 170 Mflops/sec, respectively, on a single proces- 
sor of the Cray Y-MP located at the NASA Lewis 
Research Center. 

The CPU times for the subsonic airfoil calcu- 
lation are plotted in figure 7. This figure demon- 
strates that a parallel implementation on 32 nodes 
can match the turnaround time of a serial imple- 
mentation on a single processor of a Cray Y-MP. In 
addition, a parallel implementation on 256 nodes is 
three times faster than a Cray Y-MP implementa- 
tion, in terms of CPU time to convergence. Note, 
that for larger problem sizes, the potential gain in 
CPU time with 256 nodes is much larger. This is 
because the ratio of computational work to commu- 
nication work increases with problem size, and each 
processor is utilized more efficiently. 


SUBSONIC AIRFOIL 



Figure 7. CPU time Characteristics for Subsonic 
Airfoil 

Figure 8 shows a profile of the run-time char- 
acteristics for the parallel code on 32(8 * 4) of the 
Intel Hypercube, for the 257 * 65 grid airfoil prob- 
lem. The computational work is in slight imbal- 
ance because the nodes which process the boundary- 
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condition information perform more work than the 
‘interior’ nodes. The communication load is in slight 
imbalance because the number and length of mes- 
sages varies across each node (e.g. boundary nodes 
have fewer messages). However, this does not have 
an effect on the overall load balance since the com- 
munication time is only a small fraction («5%) of 
the total CPU time. 
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Figure 8 . Run-Time Profiling Characteristics (32 
nodes) 

In this work, three grids of dimensions 193*161, 
257*257 and 513*513 were employed to study the 
effects of computational load on parallel perfor- 
mance. The backward-facing step problem was cho- 
sen as the test case. CPU times for 100 time-steps 
(10 sub-iterations per time-step) of the unprecon- 
ditioned GMRES solver were recorded. The per- 
formance for each grid is summarized in figure 9. A 
peak performance corresponding to 2300 Mflops/sec 
(512 nodes) is achieved for the 513*513 grid. The 
256*256 and 193*161 grids have reduced peak per- 
formances of 1350 Mflops/sec and 750 Mflops/sec, 
respectively. 


MFlops vs. # of Processors 



Figure 9. Variation of Performance vs. Load 

The ‘ideal’ performance (figure 9) is based on 
the single-node performance of 5.9 Mflops/sec for 
a domain size of « 3000 points. This implies that 


for a fixed size of the uniprocessor grid, the per- 
formance will be less than ‘ideal’ as the number 
of processors increases (and the domain size de- 
creases). This is evident from the performance 
numbers for the 193*161 grid, which deteriorate 
rapidly from 5.3 Mflops/sec/node (16 nodes) to 
2.32 Mflops/sec/node (256 nodes). However, when 
the grid size increases to 257*257, the performance 
numbers range from 5.6 Mflops/sec/node (32 nodes) 
to 3.4 Mflops/sec/node (256 nodes). The 513*513 
grid performs in the range of 5.2 Mflops/sec/node 
(128 nodes) to 4.8 Mflops/sec (256 nodes). 

The parallel efficiency for N processors, 77 #, is 
calculated as 

2i 

VN NT n 

where T\ is the CPU time for one processor and 
Tjv is the CPU time for N processors. The parallel 
efficiency characteristics for different grid sizes are 
shown in figure 10. As expected, 77 # decreases as 
the number of grid points per processor decreases. 
It is estimated that « 1024 grid points per node 
are required to keep rjw above 80%. Speedup fac- 
tors for the different grids and processors can be 
derived from the values of 77 ^ by using the rela- 
tion Sx = 7 ]n * N. For example, for the 513*513 
grid, when N — 512 and 77 jv= 0 . 74 , Sjy = 379. It 
must be remarked that the maximum speed-up for 
a fixed-size problem is governed by Amdahl’s law 19 , 
and depends on the distribution of sequential and 
parallel work in the code. 


Efficiency vs. # of Processors 



Figure 10 . Variation of Efficiency vs. Load 

Conclusions and Future Work 

A parallel, implicit solver has been developed 
for distributed-memory parallel machines, for ob- 
taining steady-state solutions of the compressible 
Navier-Stokes equations with a state-of-the-art CFD 
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code. The implicit solver is a combination of a paral- 
lelized Krylov solver (GMRES) and a scalable, local 
parallel preconditioner. This paper shows that the 
parallel, implicit solver provides steady- state con- 
vergence rates which compare excellently with se- 
rial implicit solvers used in shared-memory imple- 
mentations. The domain-decomposition strategies 
adopted in this paper are validated for internal and 
external flow test cases, on a wide range of process- 
ing nodes. 

The performance of the parallel CFD code 
varies as a function of the computational workload 
and the communication overhead for each proces- 
sor. The parallel efficiency (defined as ratio of ac- 
tual speedup to ideal speedup) is found to decrease 
as the amount of computational workload (or num- 
ber of grid points) per processor decreases. The 
parallel CFD code peaks at a computational rate of 
2300 Mflops/sec on a 513 * 513 grid on 512 nodes 
of the Intel Delta machine. A parallel efficiency of 
80% or greater is achieved if each processing node is 
assigned at least 1024 grid points. The parallel im- 
plementation is determined to be memory-bound, as 
a maximum of 3200 grid points can be accomodated 
in the 2MW RAM of each processor. The communi- 
cation overheads are determined to be largely inde- 
pendent of the nature of the domain decomposition 
and the assignment of domains to processors. The 
total communication time constitutes roughly 5-7% 
of the total execution time. 

The attainable single-node performance on the 
Intel machines (Hypercube or Delta) is 30 times 
lower than that on a single processor of a Cray Y- 
MP (6 Mflops/sec versus 170 Mflops/sec). How- 
ever, a subsonic airfoil calculation on 256 nodes 
is demonstrated to run three times faster than a 
single-processor Cray Y-MP computation. Consid- 
erable improvements in the areas of compilers, data 
cacheing, memory-access times and I/O operations 
are required to further enhance the competitiveness 
of parallel machines for large, three-dimensional, 
unsteady- flow simulations of fluid-flow problems. 
Improvements in parallel algorithms, solvers and 
programming models will also contribute to the ac- 
ceptability of parallel machines in widespread CFD 
applications. 

The implicit, parallel CFD code developed 
in this paper is being integrated into a multi- 
disciplinary design environment. Efforts are cur- 
rently underway to parallelize the turbulence mod- 
els and sensitivity-analysis algorithms, to obtain 
design-sensitivities for a large number of design 
variables in parallel. Recent Krylov solvers (CGS, 


BiCGSTAB and QMR) are also being studied to 
evaluate their competitiveness in parallel environ- 
ments. 
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